“We, the undersigned, certify that the report submitted is our original work; all authors participated in the work in a substansive way; all authors have seen and approved the report as submitted; the text, images, illustrations, and other items included in the manuscript do not carry any ingringement/plagiarism issue upon any existing copyrighted materials.”
| Team Member | Names of Signed Team Members |
|---|---|
| Contact Member | Chinchwade, Nikita |
| Team Member 2 | Goyal, Pushkar |
| Team Member 3 | Komarraju, Megha |
| Team Member 4 | Naphade, Chinmay |
| Team Member 5 | Saraiya, Parth |
| Team Member 6 | Singh, Prachi |
Airbnbs in New York City have the second highest host profits out of all the cities in the US, with revenues going up to $741,167,296. Average price of a New York Airbnb listing is $157. The market consists of 5 boroughs: Manhattan, Brooklyn, Queens, the Bronx and Staten Island.
Buying and maintaining properties in NYC is expensive.For a risk averse investor, the costs associated with an unprofitable investment is significant and can incur a greater loss as compared to the cost associated with missing out on a potentially profitable opportunity.Is it possible to minimize the risk associated with falsely classifying a property with a low booking rate as one with a high booking rate ?
We included the data preprocessing from the Kaggle Competition.
dfaTrain <- read_csv("airbnbTrain.csv")
dfaTest <- read_csv("airbnbTest.csv")
#Using string processing to filter New York data
dfaTrainNewYork <- dfaTrain %>%
filter(market == "New York")
#Using Random Control variable to filter New York data
dfaTrain<-dfaTrain %>% filter(as.integer(dfaTrain$`{randomControl}`/1000)==116)
dfaTrainDup <- dfaTrain
#They seem to have the same unique states
unique(dfaTrainNewYork$state)
[1] "NY" "NJ" "Ny" NA "ny" "New York"
unique(dfaTrain$state)
[1] "NY" "Ny" NA "NJ" "ny" "New York"
#However, dfaTrain (filtered with Random Control) has only one New Jersey zipcode
dfaTrainNJ <- dfaTrain %>% filter(state=="NJ")
unique(dfaTrainNJ$zipcode)
[1] "07093"
#dfaTrainNewYork (filtered with string processing) has many New Jersey zipcodes
dfaTrainNewYorkNJ <- dfaTrainNewYork %>% filter(state=="NJ")
unique(dfaTrainNewYorkNJ$zipcode)
[1] "07306" "07302" "07310" "07304" "07307" "07305" "07087" "07311" "07047" "07030"
[11] NA "07093" "07002" "07302-8544"
skim(dfaTrain)
── Data Summary ────────────────────────
Values
Name dfaTrain
Number of rows 30359
Number of columns 66
_______________________
Column type frequency:
character 31
Date 1
logical 9
numeric 25
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
[38;5;250m 1[39m access [4m1[24m[4m4[24m148 0.534 1 [4m1[24m000 0 [4m1[24m[4m4[24m370 0
[38;5;250m 2[39m amenities 0 1 2 [4m1[24m479 0 [4m2[24m[4m8[24m141 0
[38;5;250m 3[39m bed_type 0 1 5 13 0 5 0
[38;5;250m 4[39m cancellation_policy 0 1 6 27 0 6 0
[38;5;250m 5[39m city 78 0.997 2 29 0 190 0
[38;5;250m 6[39m cleaning_fee [4m6[24m310 0.792 5 7 0 191 0
[38;5;250m 7[39m description 648 0.979 1 [4m1[24m000 0 [4m2[24m[4m8[24m798 0
[38;5;250m 8[39m extra_people 0 1 5 7 0 97 0
[38;5;250m 9[39m host_about [4m1[24m[4m2[24m184 0.599 1 [4m1[24m[4m0[24m785 0 [4m1[24m[4m4[24m129 50
[38;5;250m10[39m host_acceptance_rate 342 0.989 3 3 0 1 0
[38;5;250m11[39m host_location 429 0.986 2 161 0 [4m1[24m146 0
[38;5;250m12[39m host_neighbourhood [4m4[24m273 0.859 4 44 0 361 0
[38;5;250m13[39m host_response_rate 342 0.989 2 4 0 78 0
[38;5;250m14[39m host_response_time 342 0.989 3 18 0 5 0
[38;5;250m15[39m host_verifications 0 1 2 170 0 471 0
[38;5;250m16[39m house_rules [4m1[24m[4m1[24m632 0.617 1 [4m1[24m000 0 [4m1[24m[4m6[24m089 0
[38;5;250m17[39m interaction [4m1[24m[4m2[24m307 0.595 1 [4m1[24m000 0 [4m1[24m[4m6[24m117 0
[38;5;250m18[39m market 61 0.998 4 19 0 10 0
[38;5;250m19[39m monthly_price [4m2[24m[4m7[24m374 0.098[4m3[24m 7 10 0 490 0
[38;5;250m20[39m neighborhood_overview [4m1[24m[4m0[24m544 0.653 1 [4m1[24m000 0 [4m1[24m[4m7[24m421 1
[38;5;250m21[39m neighbourhood 6 1.00 4 29 0 186 0
[38;5;250m22[39m notes [4m1[24m[4m8[24m052 0.405 1 [4m1[24m000 0 [4m1[24m[4m0[24m712 0
[38;5;250m23[39m price 0 1 5 10 0 579 0
[38;5;250m24[39m property_type 0 1 3 22 0 34 0
[38;5;250m25[39m room_type 0 1 10 15 0 4 0
[38;5;250m26[39m security_deposit [4m1[24m[4m0[24m558 0.652 5 9 0 172 0
[38;5;250m27[39m space [4m8[24m561 0.718 1 [4m1[24m000 0 [4m2[24m[4m0[24m344 0
[38;5;250m28[39m state 3 1.00 2 8 0 5 0
[38;5;250m29[39m transit [4m1[24m[4m0[24m417 0.657 1 [4m1[24m000 0 [4m1[24m[4m8[24m074 2
[38;5;250m30[39m weekly_price [4m2[24m[4m6[24m918 0.113 6 10 0 478 0
[38;5;250m31[39m zipcode 284 0.991 5 11 0 213 0
── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
[38;5;250m1[39m host_since 342 0.989 2008-09-07 2019-12-03 2015-06-22 [4m3[24m591
── Variable type: logical ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean count
[38;5;250m1[39m host_has_profile_pic 342 0.989 0.997 TRU: 29935, FAL: 82
[38;5;250m2[39m host_identity_verified 342 0.989 0.461 FAL: 16193, TRU: 13824
[38;5;250m3[39m host_is_superhost 342 0.989 0.195 FAL: 24172, TRU: 5845
[38;5;250m4[39m instant_bookable 0 1 0.379 FAL: 18842, TRU: 11517
[38;5;250m5[39m is_business_travel_ready 0 1 0 FAL: 30359
[38;5;250m6[39m is_location_exact 0 1 0.825 TRU: 25060, FAL: 5299
[38;5;250m7[39m require_guest_phone_verification 0 1 0.023[4m2[24m FAL: 29655, TRU: 704
[38;5;250m8[39m require_guest_profile_picture 0 1 0.020[4m9[24m FAL: 29723, TRU: 636
[38;5;250m9[39m requires_license 0 1 0 FAL: 30359
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
[38;5;250m 1[39m id 0 1 1[4m1[24m[4m0[24m[4m1[24m021. [4m5[24m[4m8[24m300. 1[4m0[24m[4m0[24m[4m0[24m009 1[4m0[24m[4m5[24m[4m0[24m657 1[4m1[24m[4m0[24m[4m0[24m476 1[4m1[24m[4m5[24m[4m1[24m846. 1[4m2[24m[4m0[24m[4m2[24m115
[38;5;250m 2[39m high_booking_rate 0 1 0.204 0.403 0 0 0 0 1
[38;5;250m 3[39m accommodates 0 1 2.86 1.90 1 2 2 4 22
[38;5;250m 4[39m availability_30 0 1 7.10 9.49 0 0 1 13 30
[38;5;250m 5[39m availability_365 0 1 113. 136. 0 0 41 228 365
[38;5;250m 6[39m availability_60 0 1 19.3 21.5 0 0 8 39 60
[38;5;250m 7[39m availability_90 0 1 32.2 34.5 0 0 14 67 90
[38;5;250m 8[39m bathrooms 30 0.999 1.15 0.445 0 1 1 1 15.5
[38;5;250m 9[39m bedrooms 43 0.999 1.18 0.761 0 1 1 1 14
[38;5;250m10[39m beds 77 0.997 1.55 1.14 0 1 1 2 40
[38;5;250m11[39m guests_included 0 1 1.52 1.17 1 1 1 2 16
[38;5;250m12[39m host_listings_count 342 0.989 17.6 113. 0 1 1 2 [4m1[24m767
[38;5;250m13[39m latitude 0 1 40.7 0.055[4m2[24m 40.5 40.7 40.7 40.8 40.9
[38;5;250m14[39m longitude 0 1 -[31m74[39m[31m.[39m[31m0[39m 0.047[4m8[24m -[31m74[39m[31m.[39m[31m2[39m -[31m74[39m[31m.[39m[31m0[39m -[31m74[39m[31m.[39m[31m0[39m -[31m73[39m[31m.[39m[31m9[39m -[31m73[39m[31m.[39m[31m7[39m
[38;5;250m15[39m maximum_nights 0 1 [4m7[24m[4m2[24m657. 12[4m3[24m[4m2[24m[4m6[24m035. 1 29 [4m1[24m000 [4m1[24m125 [4m2[24m147[4m4[24m[4m8[24m[4m3[24m647
[38;5;250m16[39m minimum_nights 0 1 7.38 21.1 1 1 2 5 [4m1[24m125
[38;5;250m17[39m review_scores_accuracy [4m6[24m673 0.780 9.61 0.860 2 9 10 10 10
[38;5;250m18[39m review_scores_checkin [4m6[24m684 0.780 9.74 0.744 2 10 10 10 10
[38;5;250m19[39m review_scores_cleanliness [4m6[24m665 0.780 9.27 1.09 2 9 10 10 10
[38;5;250m20[39m review_scores_communication [4m6[24m670 0.780 9.74 0.764 2 10 10 10 10
[38;5;250m21[39m review_scores_location [4m6[24m684 0.780 9.58 0.765 2 9 10 10 10
[38;5;250m22[39m review_scores_rating [4m6[24m651 0.781 93.8 8.80 20 92 96 100 100
[38;5;250m23[39m review_scores_value [4m6[24m686 0.780 9.38 0.936 2 9 10 10 10
[38;5;250m24[39m square_feet [4m3[24m[4m0[24m147 0.006[4m9[24m[4m8[24m 679. 487. 0 322. 700 900 [4m2[24m400
[38;5;250m25[39m {randomControl} 0 1 [4m1[24m[4m1[24m[4m6[24m501. 289. [4m1[24m[4m1[24m[4m6[24m000 [4m1[24m[4m1[24m[4m6[24m251 [4m1[24m[4m1[24m[4m6[24m502 [4m1[24m[4m1[24m[4m6[24m748 [4m1[24m[4m1[24m[4m6[24m999
hist
[38;5;250m 1[39m ▇▇▇▇▇
[38;5;250m 2[39m ▇▁▁▁▂
[38;5;250m 3[39m ▇▁▁▁▁
[38;5;250m 4[39m ▇▂▁▁▁
[38;5;250m 5[39m ▇▁▁▁▃
[38;5;250m 6[39m ▇▁▂▂▂
[38;5;250m 7[39m ▇▁▁▂▃
[38;5;250m 8[39m ▇▁▁▁▁
[38;5;250m 9[39m ▇▁▁▁▁
[38;5;250m10[39m ▇▁▁▁▁
[38;5;250m11[39m ▇▁▁▁▁
[38;5;250m12[39m ▇▁▁▁▁
[38;5;250m13[39m ▁▂▇▃▁
[38;5;250m14[39m ▁▁▇▂▁
[38;5;250m15[39m ▇▁▁▁▁
[38;5;250m16[39m ▇▁▁▁▁
[38;5;250m17[39m ▁▁▁▁▇
[38;5;250m18[39m ▁▁▁▁▇
[38;5;250m19[39m ▁▁▁▁▇
[38;5;250m20[39m ▁▁▁▁▇
[38;5;250m21[39m ▁▁▁▁▇
[38;5;250m22[39m ▁▁▁▁▇
[38;5;250m23[39m ▁▁▁▁▇
[38;5;250m24[39m ▆▇▂▁▁
[38;5;250m25[39m ▇▇▇▇▇
#Splitting amenities
sampledf <- dfaTrain %>%
select(id, amenities) %>%
mutate(amenities = gsub('[{]', '', (gsub('[}]', '', amenities))))
df1 <- cSplit_e(sampledf, 'amenities', ',', type= 'character', fill=0, drop=TRUE)
names(df1) <- sub('.*_', '', names(df1))
i <- (colSums(Filter(is.numeric, df1)) > 1000)
df1 <- df1[,i]
df1 <-
df1 %>%
mutate_at(names(df1)[-1] , ~factor(.))
dfaTrain <- merge(dfaTrain, df1, by.x = "id", by.y = "id")
#Removing the '$' symbol
dfaTrain <-
dfaTrain %>%
mutate(
cleaning_fee = as.numeric(gsub('[$]', '', (gsub('[,]', '', cleaning_fee)))),
extra_people = as.numeric(gsub('[$]', '', (gsub('[,]', '', extra_people)))),
price = as.numeric(gsub('[$]', '', (gsub('[,]', '', price)))),
security_deposit =as.numeric(gsub('[$]', '', (gsub('[,]', '', security_deposit))))
)
#Converting percentage to numeric
dfaTrain <-
dfaTrain %>%
mutate(host_response_rate = as.numeric(gsub('[%]', '', host_response_rate))) %>%
mutate(host_response_rate = host_response_rate / 100)
#Converting discrete variables to factors
cols_to_factor <- c(
"cancellation_policy",
"host_response_time",
"property_type",
"room_type",
"host_identity_verified",
"host_is_superhost",
"instant_bookable",
"is_location_exact",
"requires_license",
"bed_type"
)
dfaTrain <-
dfaTrain %>%
mutate_at(cols_to_factor, ~factor(.))
#Recoding bed_type factor levels
dfaTrain <- dfaTrain %>%
mutate(
bed_type = recode_factor(
bed_type,
Airbed = "Not a Real Bed",
Couch = "Not a Real Bed",
Futon = "Not a Real Bed",
`Pull-out Sofa` = "Not a Real Bed"
))
#Recoding cancellation_policy factor levels
dfaTrain <- dfaTrain %>%
mutate(
cancellation_policy = recode_factor(
cancellation_policy,
luxury_no_refund = "luxury",
luxury_super_strict_125 = "luxury",
luxury_super_strict_95 = "luxury",
luxury_moderate = "luxury",
strict = "luxury"
))
#Removing NAs
dfaTrain <-
dfaTrain %>%
replace_na(list(host_response_time = "N/A", host_is_superhost = FALSE, host_identity_verified = FALSE))
# Creating other property_type
combinedPropertyType <- pull(dfaTrain %>%
group_by(property_type) %>%
tally() %>%
filter(n < 1000) %>%
arrange(n) %>%
select(property_type) %>%
mutate(property_type = as.character(property_type)), property_type)
dfaTrain <- dfaTrain %>%
mutate(property_type = fct_collapse(property_type, Other = combinedPropertyType))
#Converting text based columns into factors
dfaTrain <- dfaTrain %>%
mutate(host_about = ifelse(is.na(host_about),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(interaction = ifelse(is.na(interaction),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(neighborhood_overview = ifelse(is.na(neighborhood_overview),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(transit = ifelse(is.na(transit),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(access = ifelse(is.na(access),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(house_rules = ifelse(is.na(house_rules),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(notes = ifelse(is.na(notes),"FALSE", "TRUE"))
dfaTrain <- dfaTrain %>%
mutate(space = ifelse(is.na(space),"FALSE", "TRUE"))
cols_to_factor <- c(
"host_about",
"space",
"notes",
"house_rules",
"access",
"transit",
"neighborhood_overview",
"interaction"
)
dfaTrain <-
dfaTrain %>%
mutate_at(cols_to_factor, ~factor(.))
#Imputing NA values
dfaTrain$bedrooms[is.na(dfaTrain$bedrooms)] =0
dfaTrain$bathrooms[is.na(dfaTrain$bathrooms)] =0
dfaTrain <- dfaTrain %>% group_by(market) %>%
mutate(cleaning_fee=ifelse(is.na(cleaning_fee),mean(cleaning_fee,na.rm=TRUE),cleaning_fee)) %>% ungroup()
#Dividing into boroughs
dfaTrain <- dfaTrain %>% mutate(borough=ifelse(substr(dfaTrain$zipcode, 1, 3) ==100 | substr(dfaTrain$zipcode, 1, 3) ==101 | substr(dfaTrain$zipcode, 1, 3) ==102, "Manhattan" , ifelse(substr(dfaTrain$zipcode, 1, 3) ==112 ,"Brooklyn",
ifelse(substr(dfaTrain$zipcode, 1, 3) ==111 | substr(dfaTrain$zipcode, 1, 3) ==113 | substr(dfaTrain$zipcode, 1, 3) ==114 | substr(dfaTrain$zipcode, 1, 3) ==110 | substr(dfaTrain$zipcode, 1, 3) ==116 ,"Queens",
ifelse(substr(dfaTrain$zipcode, 1, 3) ==103 ,"Staten Island",
ifelse(substr(dfaTrain$zipcode, 1, 3) ==104 ,"Bronx","NA"))))))
dfTrainNoZip <-dfaTrain %>% filter(is.na(zipcode))
dfTrainNoZip$borough <- dfaTrain$borough[match(dfTrainNoZip$neighbourhood, dfaTrain$neighbourhood)]
final <- merge(dfaTrain, dfTrainNoZip[ , c("id", "borough")], by = "id", all = TRUE)
final$borough <- final$borough.x %?% final$borough.y
final <- final %>% select(-c("borough.x","borough.y"))
dfaTrain <- final %>% filter(borough!="NA")
head(dfaTrain)
Extracting Subway Data:
dfzipS <- read_csv("NYsubway1.csv")
Parsed with column specification:
cols(
Line = [31mcol_character()[39m,
Stationname = [31mcol_character()[39m,
latitude = [32mcol_double()[39m,
longitude = [32mcol_double()[39m
)
skim(dfzipS)
── Data Summary ────────────────────────
Values
Name dfzipS
Number of rows 1868
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
[38;5;250m1[39m Line 0 1 5 17 0 36 0
[38;5;250m2[39m Stationname 0 1 4 39 0 356 0
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
[38;5;250m1[39m latitude 0 1 40.7 0.070[4m4[24m 40.6 40.7 40.7 40.8 40.9 ▂▅▇▃▂
[38;5;250m2[39m longitude 0 1 -[31m73[39m[31m.[39m[31m9[39m 0.057[4m2[24m -[31m74[39m[31m.[39m[31m0[39m -[31m74[39m[31m.[39m[31m0[39m -[31m74[39m[31m.[39m[31m0[39m -[31m73[39m[31m.[39m[31m9[39m -[31m73[39m[31m.[39m[31m8[39m ▇▆▃▂▁
#Dividing into boroughs
dfaTrainDup <- dfaTrainDup %>% mutate(borough=ifelse(substr(dfaTrainDup$zipcode, 1, 3) ==100 | substr(dfaTrainDup$zipcode, 1, 3) ==101 | substr(dfaTrainDup$zipcode, 1, 3) ==102, "Manhattan" , ifelse(substr(dfaTrainDup$zipcode, 1, 3) ==112 ,"Brooklyn",
ifelse(substr(dfaTrainDup$zipcode, 1, 3) ==111 | substr(dfaTrainDup$zipcode, 1, 3) ==113 | substr(dfaTrainDup$zipcode, 1, 3) ==114 | substr(dfaTrainDup$zipcode, 1, 3) ==110 | substr(dfaTrainDup$zipcode, 1, 3) ==116 ,"Queens",
ifelse(substr(dfaTrainDup$zipcode, 1, 3) ==103 ,"Staten Island",
ifelse(substr(dfaTrainDup$zipcode, 1, 3) ==104 ,"Bronx","NA"))))))
Warning messages:
1: Unknown or uninitialised column: 'borough'.
2: Unknown or uninitialised column: 'borough'.
3: Unknown or uninitialised column: 'borough'.
4: Unknown or uninitialised column: 'borough'.
5: Unknown or uninitialised column: 'borough'.
dfTrainNoZip <-dfaTrainDup %>% filter(is.na(zipcode))
dfTrainNoZip$borough <- dfaTrainDup$borough[match(dfTrainNoZip$neighbourhood, dfaTrainDup$neighbourhood)]
final <- merge(dfaTrainDup, dfTrainNoZip[ , c("id", "borough")], by = "id", all = TRUE)
final$borough <- final$borough.x %?% final$borough.y
final <- final %>% select(-c("borough.x","borough.y"))
dfaTrainDup <- final %>% filter(borough!="NA")
dfaTrainDup <- dfaTrainDup %>% mutate(total_amenities = ifelse(nchar(amenities)>2, str_count(amenities, ',')+1, 0))
Map for Manhattan Borough:
dfM <- dfaTrainDup %>% filter(borough=='Manhattan')
library("leaflet")
#Creating Listings across NYC
leaflet(dplyr::bind_rows(dfM,dfzipS)) %>%
addTiles() %>%
addCircleMarkers(data=dfM, color = '#9D7',
opacity = 1,~dfM$longitude, ~dfM$latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0( "<br> <b> Price: </b>", dfM$price, "<br/><b> Room Type: </b>", dfM$room_type, "<br/><b> Property Type: </b>", dfM$property_type,"<br> <b> Amenities: </b>", dfM$total_amenities,"<br> <b> Review Score: </b>", dfM$review_scores_value,"<br> <b> Amenities: </b>", dfM$total_amenities,"<br> <b> Booking Rate: </b>", dfM$high_booking_rate
)) %>%
addCircleMarkers(data=dfzipS,~longitude, ~latitude,color = '#FA5',labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),opacity = 1,popup = paste0( "<br> <b> Station Name: </b>", dfzipS$Stationname)) %>%
setView(-74.00, 40.71, zoom = 12)
#%>%
#addProviderTiles("CartoDB.Positron")
Map for Brooklyn Borough:
dfB <- dfaTrainDup %>% filter(borough=='Brooklyn')
library("leaflet")
#Creating Listings across NYC
leaflet(dplyr::bind_rows(dfB,dfzipS)) %>%
addTiles() %>%
addCircleMarkers(data=dfB, color = '#9D7',
opacity = 1,~longitude, ~latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0( "<br> <b> Price: </b>", dfB$price, "<br/><b> Room Type: </b>", dfB$room_type, "<br/><b> Property Type: </b>", dfB$property_type,"<br> <b> Amenities: </b>", dfB$total_amenities,"<br> <b> Review Score: </b>", dfB$review_scores_value,"<br> <b> Amenities: </b>", dfB$total_amenities,"<br> <b> Booking Rate: </b>", dfB$high_booking_rate
)) %>%
addCircleMarkers(data=dfzipS,color = '#FA5',opacity = 1,popup = paste0( "<br> <b> Station Name: </b>", dfzipS$Stationname)) %>%
setView(-74.00, 40.71, zoom = 12)%>%
addProviderTiles("CartoDB.Positron")
Assuming "longitude" and "latitude" are longitude and latitude, respectively
Map for Queens Borough:
dfQ <- dfaTrainDup %>% filter(borough=='Queens')
library("leaflet")
#Creating Listings across NYC
leaflet(dplyr::bind_rows(dfQ,dfzipS)) %>%
addTiles() %>%
addCircleMarkers(data=dfQ, color = '#9D7',
opacity = 1,~longitude, ~latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0( "<br> <b> Price: </b>", dfQ$price, "<br/><b> Room Type: </b>", dfQ$room_type, "<br/><b> Property Type: </b>", dfQ$property_type,"<br> <b> Amenities: </b>", dfQ$total_amenities,"<br> <b> Review Score: </b>", dfQ$review_scores_value,"<br> <b> Amenities: </b>", dfQ$total_amenities,"<br> <b> Booking Rate: </b>", dfQ$high_booking_rate
)) %>%
addCircleMarkers(data=dfzipS,color = '#FA5',opacity = 1,popup = paste0( "<br> <b> Station Name: </b>", dfzipS$Stationname)) %>%
setView(-74.00, 40.71, zoom = 12)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
#%>%
#addProviderTiles("CartoDB.Positron")
Map for Bronx Borough:
dfBr <- dfaTrainDup %>% filter(borough=='Bronx')
library("leaflet")
#Creating Listings across NYC
leaflet(dplyr::bind_rows(dfBr,dfzipS)) %>%
addTiles() %>%
addCircleMarkers(data=dfBr, color = '#9D7',
opacity = 1,~longitude, ~latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0( "<br> <b> Price: </b>", dfBr$price, "<br/><b> Room Type: </b>", dfBr$room_type, "<br/><b> Property Type: </b>", dfBr$property_type,"<br> <b> Amenities: </b>", dfBr$total_amenities,"<br> <b> Review Score: </b>", dfBr$review_scores_value,"<br> <b> Amenities: </b>", dfBr$total_amenities,"<br> <b> Booking Rate: </b>", dfBr$high_booking_rate
)) %>%
addCircleMarkers(data=dfzipS,color = '#FA5',opacity = 1,popup = paste0( "<br> <b> Station Name: </b>", dfzipS$Stationname)) %>%
setView(-74.00, 40.71, zoom = 12)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
Map for Staten Island Borough:
dfS <- dfaTrainDup %>% filter(borough=='Staten Island')
library("leaflet")
#Creating Listings across NYC
leaflet(dplyr::bind_rows(dfS,dfzipS)) %>%
addTiles() %>%
addCircleMarkers(data=dfS, color = '#9D7',
opacity = 1,~longitude, ~latitude,labelOptions = labelOptions(noHide = F),clusterOptions = markerClusterOptions(),popup = paste0( "<br> <b> Price: </b>", dfS$price, "<br/><b> Room Type: </b>", dfS$room_type, "<br/><b> Property Type: </b>", dfS$property_type,"<br> <b> Amenities: </b>", dfS$total_amenities,"<br> <b> Review Score: </b>", dfS$review_scores_value,"<br> <b> Amenities: </b>", dfS$total_amenities,"<br> <b> Booking Rate: </b>", dfS$high_booking_rate
)) %>%
addCircleMarkers(data=dfzipS,color = '#FA5',opacity = 1,popup = paste0( "<br> <b> Station Name: </b>", dfzipS$Stationname)) %>%
setView(-74.00, 40.71, zoom = 12)
Assuming "longitude" and "latitude" are longitude and latitude, respectively
a) Distribution of high and low booking rates across boroughs
dfaTrain %>%
group_by(borough) %>%
summarize(HighBookingYes = length(borough[as.factor(high_booking_rate) == 1]),
HighBookingNo = length(borough[as.factor(high_booking_rate) == 0])) %>%
mutate(HighBookingYesPct = HighBookingYes*100/(HighBookingYes+HighBookingNo),
HighBookingNoPct = HighBookingNo*100/(HighBookingYes+HighBookingNo)) %>%
arrange(desc(HighBookingYesPct))
Manhattan and Brooklyn have the most number of properties. However, this doesn’t necessarily translate into a greater percentage of high booking rates. Staten Island with the least number of properties, has a relatively greater percentage of high booking rates.
plotBoroughBooking <- ggplot(data = dfaTrain, aes(x=borough, fill=as.factor(high_booking_rate))) +
geom_histogram(color='black',stat='count') +xlab('Borough') + ylab('Number of Properties') +ggtitle("Distribution of booking rates broken down by boroughs")
Ignoring unknown parameters: binwidth, bins, pad
plotBoroughBooking
The properties with high booking rates are less than the ones with low booking rates. This might be due to lesser number of properties having certain desired amenities as compared to the properties with all required amenities.
b) Distribution of prices across boroughs
plotPriceBorough <- ggplot(data = dfaTrain, aes(x=as.factor(borough), y=price)) +
geom_boxplot() + xlab('Borough') +ylab('Price')
plotPriceBorough
plotPriceBorough2 <- ggplot(data = dfaTrain, aes(x=as.factor(borough), y=price)) +
geom_boxplot(outlier.shape = NA) + xlab('Borough') +ylab('Price') + ylim(0,500)
plotPriceBorough2
Prices seem to be higher and have a wider distribution in Manhattan and Brooklyn. This might also be contributing to a relatively lower percentage of high booking rates in these boroughs.
Is there multicollinearity between price and location?
car::vif(lm(high_booking_rate ~ borough+price,data=dfaTrain))
GVIF Df GVIF^(1/(2*Df))
borough 1.017509 4 1.002172
price 1.017509 1 1.008717
Price moves independently of location. There doesn’t seem to be direct correlation between location and price.
c) Impact of review scores on booking rate
#Review Scores
summary(lm(high_booking_rate~review_scores_accuracy+review_scores_checkin+review_scores_cleanliness+review_scores_communication+review_scores_location+review_scores_rating+review_scores_value,data=dfaTrain))
Call:
lm(formula = high_booking_rate ~ review_scores_accuracy + review_scores_checkin +
review_scores_cleanliness + review_scores_communication +
review_scores_location + review_scores_rating + review_scores_value,
data = dfaTrain)
Residuals:
Min 1Q Median 3Q Max
-0.7114 -0.2741 -0.2504 0.6655 0.9821
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3109355 0.0448551 -6.932 4.26e-12 ***
review_scores_accuracy 0.0421256 0.0054898 7.673 1.74e-14 ***
review_scores_checkin 0.0486050 0.0054766 8.875 < 2e-16 ***
review_scores_cleanliness 0.0445799 0.0038287 11.644 < 2e-16 ***
review_scores_communication 0.0239246 0.0057342 4.172 3.03e-05 ***
review_scores_location -0.0243961 0.0043704 -5.582 2.40e-08 ***
review_scores_rating -0.0075461 0.0006699 -11.264 < 2e-16 ***
review_scores_value -0.0010512 0.0051130 -0.206 0.837
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4349 on 23617 degrees of freedom
(6550 observations deleted due to missingness)
Multiple R-squared: 0.02145, Adjusted R-squared: 0.02116
F-statistic: 73.94 on 7 and 23617 DF, p-value: < 2.2e-16
Except for review_score_value, all other review scores seem to be statistically significant. Is there multicollinearity between the review scores ?
car:: vif(lm(high_booking_rate~review_scores_accuracy+review_scores_checkin+review_scores_cleanliness+review_scores_communication+review_scores_location+review_scores_rating+review_scores_value,data=dfaTrain))
review_scores_accuracy review_scores_checkin review_scores_cleanliness review_scores_communication
2.770427 2.057186 2.157580 2.376300
review_scores_location review_scores_rating review_scores_value
1.396017 4.310059 2.843359
Yes, there seem to be multicollinearity between the review scores. Possibly, review score rating is the overall rating and hence could be the reason for the multicollinearity.
car:: vif(lm(high_booking_rate~review_scores_accuracy+review_scores_checkin+review_scores_cleanliness+review_scores_communication+review_scores_location+review_scores_value,data=dfaTrain))
review_scores_accuracy review_scores_checkin review_scores_cleanliness review_scores_communication
2.575793 2.049272 1.834114 2.265716
review_scores_location review_scores_value
1.381711 2.468226
car:: vif(lm(high_booking_rate~review_scores_checkin+review_scores_cleanliness+review_scores_communication+review_scores_location+review_scores_value,data=dfaTrain))
review_scores_checkin review_scores_cleanliness review_scores_communication review_scores_location
1.991415 1.714060 2.183338 1.370716
review_scores_value
2.146033
Also, review_scores_accuracy captures information about how accurately the space was represented by the listing. A part of this information could also be captured by the other review_scores. Hence, we could remove review_score_accuracy as well.
summary(lm(high_booking_rate~review_scores_checkin+review_scores_cleanliness+review_scores_communication+review_scores_location+review_scores_value,data=dfaTrain))
Call:
lm(formula = high_booking_rate ~ review_scores_checkin + review_scores_cleanliness +
review_scores_communication + review_scores_location + review_scores_value,
data = dfaTrain)
Residuals:
Min 1Q Median 3Q Max
-0.5536 -0.2816 -0.2495 0.6831 0.9069
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.251603 0.044659 -5.634 1.78e-08 ***
review_scores_checkin 0.047531 0.005392 8.815 < 2e-16 ***
review_scores_cleanliness 0.032111 0.003420 9.390 < 2e-16 ***
review_scores_communication 0.014373 0.005501 2.613 0.00899 **
review_scores_location -0.026757 0.004344 -6.160 7.38e-10 ***
review_scores_value -0.013942 0.004450 -3.133 0.00173 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4362 on 23622 degrees of freedom
(6547 observations deleted due to missingness)
Multiple R-squared: 0.01522, Adjusted R-squared: 0.01501
F-statistic: 73.02 on 5 and 23622 DF, p-value: < 2.2e-16
d) Distribution of property types across boroughs
propertydf <- dfaTrainDup %>% group_by(borough, property_type) %>% summarize(Freq = n())
propertydf <- propertydf %>% filter(property_type %in% c("Apartment","House","Condominium","Townhouse", "Loft","Guest suite"))
totalproperty<- dfaTrainDup %>% filter(property_type %in% c("Apartment","House","Condominium","Townhouse", "Loft","Guest suite"))%>% group_by(borough) %>% summarize(sum = n())
propertyratio <- merge(propertydf, totalproperty, by="borough")
propertyratio <- propertyratio %>% mutate(ratio = Freq/sum)
ggplot(propertyratio, aes(x=borough, y=ratio, fill = property_type)) +
geom_bar(position = "dodge",stat="identity") + xlab("Borough") + ylab("Count")+
scale_fill_discrete(name = "Property Type") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Property Types in NYC",
subtitle = "Map showing Percentage Count of Property Type by Borough ") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+scale_color_gradient(low="#d3cbcb", high="#852eaa")+
scale_fill_manual("Property Type", values=c("#e06f69","#357b8a", "#7db5b8", "#59c6f3", "#f6c458","#00FF00")) +
xlab("Neighborhood") + ylab("Percentage")
Considering the most preferred property types in all the boroughs plotted above,we can see that Apartment is the most preferred Property type in Bronx, Brooklyn, Manhattan and Queens.This could be mainly due to the people visiting these boroughs for sight seeing or business trips. On the other hand, Houses are most preferred in Staten Island as they are less expensive than the houses in other boroughs and people visit it mostly for leisure.
e) Room Type Analysis
Distrbution of properties in NYC based on room type along with the count of the high booking rate properties and it’s percentage
dfr1 <- dfaTrain %>% group_by(room_type) %>% summarise(count_highbooking=sum(high_booking_rate),total=length(high_booking_rate))%>%
mutate(Pct = count_highbooking*100/total) %>% arrange(desc(Pct))
dfr1
Distribution of Properties based on Room Type
plot1 <- ggplot(data = dfr1, aes(x=room_type, y=total,fill=room_type)) +
geom_bar(stat = "identity", width=0.5,colour="black")+labs(title = "Distribution of Properties based on Room Type",x="Room Type",y="No of Properties")
plot1
Percentage of Properties with high booking rate based on Room Type
plot2 <- ggplot(data = dfr1, aes(x=room_type, y=Pct,fill=room_type)) +
geom_bar(stat = "identity", width=0.5,colour="black")+labs(title = "Percentage of Properties with high booking rate based on Room Type",x="Room Type",y="Percentage of Properties")
plot2
Distribution of properties with high booking rate in each borough
dfr4 <- dfaTrain %>% filter(high_booking_rate==1) %>% group_by(borough,room_type)%>%
summarise(total_high_booking_rate=length(high_booking_rate))
plot3 <- ggplot(data = dfr4, aes(x=borough, y=total_high_booking_rate,fill=room_type)) +
geom_bar(stat = "identity",position="stack",colour="black")+labs(title="Distribution of properties with high booking rate in each borough",x="Borough",y="Count of high booking rate properties")
plot3
From the above visualisations, we observe that Entire Home/Apartment and Private rooms are the most popular room types offered in NYC and also the most successful ones in terms of high booking rate.
If we try going neighbourhood wise,
dfaTrain %>% group_by(room_type,neighbourhood) %>% summarise(count_highbooking=sum(high_booking_rate),total=length(high_booking_rate))%>%
mutate(Pct = count_highbooking*100/total) %>% arrange(desc(Pct)) %>% filter(total>10)
No such clear inference if we go neighbourhood wise.
Distribution of count of high booking properties in each borough and the total properties based on different room types available (Room types having greater than 10 observations)
dfr2 <- dfaTrain %>% group_by(room_type,borough) %>% summarise(count_highbooking=sum(high_booking_rate),total_properties=length(high_booking_rate))%>%
mutate(Percentage_highbooking = count_highbooking*100/total_properties) %>% arrange(desc(Percentage_highbooking)) %>% filter(total_properties>10)
dfr2
NA
You can see that Entire home/apt and Private room standout as the most successful ones here when we check every borough
Different roomtypes available in each borough along with their success rate in an ascending order
dfr3 <- dfaTrain %>% group_by(borough,room_type
) %>% summarise(count_highbooking=sum(high_booking_rate),total_properties=length(high_booking_rate))%>%
mutate(Percentage_highbooking = count_highbooking*100/total_properties)%>% filter(count_highbooking>5) %>% arrange(Percentage_highbooking)
dfr3
NA
Shared room in Queens or Broolyn is a bad idea but shared in Manhattan may work because of how expensive the area is.Hotel Rooms in Manhattan would not be a good idea.Entire home/apt and Private rooms seem to be having the best Percentage output but all still in the range of 15%-35%
Most popular room types per borough rank wise and visualisation of the successful listings broken down by Room types in each borough
dfr5 <- dfaTrain %>% group_by(borough,room_type
) %>% summarise(total=length(high_booking_rate)) %>% arrange(borough, desc(total)) %>% mutate(rank=rank(-total))
dfr5
Visualisation of the successful listings broken down by room type in each borough
plot4 <- ggplot(data = dfr5, aes(x=borough, y=total,fill=room_type)) +
geom_bar(stat = "identity",position="dodge",color='black')+labs(x="Borough",y="Count of successful listings",title="Visualisation of the successful listings broken down by room type in each borough")
plot4
Entire home/apt and Private rooms seem to be having the highest count of successful listings and are also the most popular ones in most of the boroughs.This could also tell us that because they do good booking rate wise they are the most popular or vice versa. Nevertheless, Entire home/apt and Private room seem to be the least risk taking option in NYC.
Some guests are social butterflies and use Airbnb to meet new people on their travels. Other guests prefer to be private and keep to themselves.Choosing an Airbnb entire home means you get the whole place to yourself. There is no sharing with hosts or other guests, it’s just you and your party. The Airbnb hosts do not stay with you when you reserve an entire home.
Why a traveller would prefer an entire home/apartment?
1. No hosts watching your every move.
2. There’s no interaction.
3. An entire place is great when you just want somewhere to relax. If you’ve been busy all day, sometimes the last thing you want to do is come back and have to be social, talking to strangers.
Why a traveller would prefer a Private room?
1. They are super cheap.
2. For people who enjoy interacting with others while also having some sort of privacy.
3. Your host knows the area and has great advice of things to see.
So depending on which market the investor wants to target he can opt for either of the above options and create a great listing.
From the above analysis we infer that the investor should either go with leasing the property as an Entire Home/Apartment or a Private room in order to get a better chance of getting a higher booking rate
f) Super Host
Airbnb awards the title of “Superhost” to a small fraction of its dependable hosts. This program is an incentive program that is a win-win for both the host, Airbnb, and their customers. The superhost gains more business in the form of higher bookings, the customer receives improved service and Airbnb profits with happy satisfied customers.
Airbnb’s site requires a host to satisfy certain requirements to be considered as a super host. We consider two parameters from the dataset and measure the performance: “Response rate” and “Ratings” which range from 0 to 100.
The scatter plot gives a few interesting insights. While most super-hosts are in the region of high-rating:high-response-rate region, we can also see a few hosts with response rates less than 75% which violates the 90%+ critera set by Airbnb although there are a very small fraction of the hosts. In regard to Ratings, almost all hosts are rated 75% and above with a very few below 75%. Most Airbnb hosts need to lie in the region of high-rating:high-response region to be considered a super host, but only a small fraction can be super hosts if the conditions are not satisfied.
Host Response Time
Visualisation of the successful listings broken down by the host response time
dfaTrain$host_response_time<- dfaTrain$host_response_time%>% replace_na("N/A")
plot5 <- ggplot(data = dfaTrain %>% filter(high_booking_rate==1), aes(x=host_response_time,fill=host_response_time)) +
geom_histogram(stat = "count",color='black')+labs(x="Response Time",y="Count of listings with high booking rate ",title="Visualisation of the successful listings broken down by the host response time")
Ignoring unknown parameters: binwidth, bins, pad
plot5
Listings having hosts whose response rate is quick, have a better booking rate. This could be another factor that should be considered after the investor has purchased the property.
Airbnb awards the title of “Superhost” to a small fraction of its dependable hosts. This program is an incentive program that is a win-win for both the host, Airbnb, and their customers. The superhost gains more business in the form of higher bookings, the customer receives improved service and Airbnb profits with happy satisfied customers.
Airbnb’s site requires a host to satisfy certain requirements to be considered as a super host.We consider two parameters from the dataset and measure the performance: “Response rate” and “Ratings” which range from 0 to 100.
The scatter plot gives a few interesting insights. While most super-hosts are in the region of high-rating:high-response-rate region, we can also see a few hosts with response rates less than 75% which violates the 90%+ critera set by Airbnb although there are a very small fraction of the hosts. In regard to Ratings, almost all hosts are rated 75% and above with a very few below 75%. Most Airbnb hosts need to lie in the region of high-rating:high-response region to be considered a super host, but only a small fraction can be super hosts if the conditions are not satisfied.
In the above plot which represents the percentage of property types for each borough, we observe except for staten island, apartments make up for the highest number of observation which is more than 50%. This could be due to the fact that most of the New yorkers (approx 65%) own properties in buldings with more than 10 units i.e., apartments (NYU Furman Center 2017). The same report from NYU Furman center also explains the phenomenon in staten island where there are less number of apartments and more number of houses. This is due to the fact that staten island has approximately 60% of the properties as houses and just over 15% apartments.
Choroplethr Map for Location Review Rating:
library(devtools)
Loading required package: usethis
Attaching package: ‘devtools’
The following object is masked from ‘package:recipes’:
check
#install_github('arilamstein/choroplethrZip@v1.5.0')
# Accessing countys file to get region of each property
dfzip <- read_csv("NYCountyZip.csv")
Parsed with column specification:
cols(
countyname = [31mcol_character()[39m,
region = [32mcol_double()[39m,
zipcode = [32mcol_double()[39m
)
dfzip$zipcode <- as.character(dfzip$zipcode)
skim(dfzip)
── Data Summary ────────────────────────
Values
Name dfzip
Number of rows 2543
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
[38;5;250m1[39m countyname 0 1 4 12 0 62 0
[38;5;250m2[39m zipcode 0 1 3 5 0 [4m2[24m169 0
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
[38;5;250m1[39m region 0 1 [4m3[24m[4m6[24m062. 35.0 [4m3[24m[4m6[24m001 [4m3[24m[4m6[24m031 [4m3[24m[4m6[24m061 [4m3[24m[4m6[24m091 [4m3[24m[4m6[24m123 ▆▅▇▅▆
Merging the New York dataset with county file:
dfaTrainDup2 <- merge(dfaTrainDup,dfzip,by.x="zipcode",by.y = "zipcode",all.x= TRUE)
dfaTrainDup <- dfaTrainDup2
Checking the columns required for plotting a choroplethr map:
data(county.regions,
package = "choroplethrMaps")
region <- county.regions %>%
filter(state.name == "new york")
region
Joining the region data for choroplethr map with our data:
plotdata <- inner_join(dfaTrainDup,
region,
by=c("region" = "region"))
Plotting the choroplethr map:
The map above represents the location reviews based on counties. We observe that Clinton county has the highest location review followed by Albany, and then Delaware.
Finding nearest subway Station and number of stations near to the property:
dfzipS$Station = paste(dfzipS$Line, dfzipS$Stationname, sep="-")
NYCSub_Stations<-aggregate(dfzipS[, 3:4], list(dfzipS$Station), mean)
colnames(NYCSub_Stations)[1] <- "Station"
dfaTrainDup$lat_dec<-format(round(dfaTrainDup$latitude, 3), nsmall = 3)
dfaTrainDup$lon_dec<-format(round(dfaTrainDup$longitude, 3), nsmall = 3)
dfaTrainDup$lat_lon = paste(dfaTrainDup$lat_dec, dfaTrainDup$lon_dec, sep="_")
NYCBor_Short<-dfaTrainDup%>%select(lat_lon,lat_dec, lon_dec)
NYCBor_short_dup<-NYCBor_Short[!duplicated(NYCBor_Short), ]
NYCBor_short_dup$shortest_metro=NA
NYCBor_short_dup$number_less_than_1=NA
Code for finding the nearest metro station and number of metro stations within a mile:
# NYCBor_short_dup$lat_dec<-as.numeric(NYCBor_short_dup$lat_dec)
# NYCBor_short_dup$lon_dec<-as.numeric(NYCBor_short_dup$lon_dec)
#
#
# for (i in 1:nrow(NYCBor_short_dup)){
#
# #print(i)
#
# shortest_dist = 1000
#
# num_less_1 = 0
#
# for (j in 1:nrow(NYCSub_Stations)){
#
# distance = sqrt(((NYCBor_short_dup[i,2]-NYCSub_Stations[j,2])*(NYCBor_short_dup[i,2]-NYCSub_Stations[j,2])) + ((NYCBor_short_dup[i,3]-NYCSub_Stations[j,3])*(NYCBor_short_dup[i,3]-NYCSub_Stations[j,3]))) * 69
#
# if (distance<=1){
# num_less_1 = num_less_1 + 1
# }
#
# if(distance<shortest_dist){
# shortest_dist=distance
# }
# else{
# shortest_dist=shortest_dist
# }
#
#
# }
#
# NYCBor_short_dup[i,4] = shortest_dist
#
# NYCBor_short_dup[i,5] = num_less_1
# }
# dfaTrainDup <- merge(dfaTrainDup,NYCBor_short_dup,by.x="lat_lon",by.y = "lat_lon",all.x= TRUE)
dfaTrainDup <- read_csv("Megha_NYC_Data.csv")
Parsed with column specification:
cols(
.default = col_double(),
lat_lon = [31mcol_character()[39m,
access = [31mcol_character()[39m,
amenities = [31mcol_character()[39m,
bed_type = [31mcol_character()[39m,
cancellation_policy = [31mcol_character()[39m,
cleaning_fee = [31mcol_character()[39m,
extra_people = [31mcol_character()[39m,
host_about = [31mcol_character()[39m,
host_identity_verified = [33mcol_logical()[39m,
host_is_superhost = [33mcol_logical()[39m,
host_neighbourhood = [31mcol_character()[39m,
host_response_rate = [31mcol_character()[39m,
host_response_time = [31mcol_character()[39m,
host_since = [34mcol_date(format = "")[39m,
host_verifications = [31mcol_character()[39m,
house_rules = [31mcol_character()[39m,
instant_bookable = [33mcol_logical()[39m,
interaction = [31mcol_character()[39m,
is_location_exact = [33mcol_logical()[39m,
market = [31mcol_character()[39m
# ... with 14 more columns
)
See spec(...) for full column specifications.
3 parsing failures.
row col expected actual file
2911 zipcode no trailing characters -3220 'Megha_NYC_Data.csv'
14735 zipcode no trailing characters
11249 'Megha_NYC_Data.csv'
23630 zipcode no trailing characters -3233 'Megha_NYC_Data.csv'
dfaTrainDup %>% select(id,borough,shortest_metro,number_less_than_1,high_booking_rate) %>% arrange(desc(number_less_than_1))
Bar chart of population of high booking rate for properties with metro stations within a mile:
dfaTrainDupPlotMetro <- dfaTrainDup %>% group_by(number_less_than_1)%>%summarize(Freq = mean(high_booking_rate)) %>%
ggplot( aes(x = number_less_than_1, y = Freq))+
geom_bar( stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Proportion of High Booking Rate by Number of metro stations within a mile") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
xlab("Count of metro stations within a mile for properties") + ylab("High Booking rate Proportion")
dfaTrainDupPlotMetro
df_total = data.frame()
for (val in 0:max(dfaTrainDup$number_less_than_1)){
x<-dfaTrainDup %>% filter(number_less_than_1<=val) %>% summarize(meanforlesser=mean(high_booking_rate))
y<- dfaTrainDup %>% filter(number_less_than_1>val) %>% summarize(meanforgreater=mean(high_booking_rate))
#print(c(val,x$mean, y$mean))
df <- data.frame(val,x,y)
df_total <- rbind(df_total,df)
}
df_total
This bar graph and the above resultset shows that the booking rate of property decreases as the number of metro stations within a mile increases. This phenomenon s observed until the number of stations is 19. But we also observe that 19 is the inflection point after which the trend reverses i.e., booking rate increases with an increase in number of metro stations within 1 mile. The first phenomenon could be due to people in those regions prefer a lower price of the property than comfort for transiting. The phenomenon revarsal could due to the fact that people might be looking for comfort for transiting than the higher price to be paid for the property. More number of metro stations within a mile may be mostly seen around Manhattan and Brooklyn where people might visit the places for business purposes and have the need to shuttle more often within NYC.
Bar graph for count of metro stations within a mile for properties:
dfaTrainDup$stationsinamile1 <- cut(x=dfaTrainDup$number_less_than_1, breaks=seq(from=-1, to=ceiling(max(dfaTrainDup$number_less_than_1)), by = 5))
dfaTrainDupPlotMetro1 <- dfaTrainDup %>% group_by(stationsinamile1,high_booking_rate)%>%summarize(Freq = n()) %>%
ggplot( aes(x = stationsinamile1, y = Freq,fill = as.factor(high_booking_rate)))+
geom_bar( stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Count of Properties by number of metro stations within a mile") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
xlab("Number of metro stations within a mile") + ylab("Count of properties")
Factor `stationsinamile1` contains implicit NA, consider using `forcats::fct_explicit_na`
dfaTrainDupPlotMetro1
dfaTrain <- merge(dfaTrain,dfaTrainDup %>% select(id, recode2), by.x="id", by.y = "id",all.x= TRUE)
df <- dfaTrain %>% mutate(stations_in_a_mile = as.factor(recode2))
This bar graph indicates that as the number of properties dereases with an increase in number of metro stations within a mile of the property. Subsequently, we observe a decrease in high booking rate.
The insights about the statistically significant and non-significant variables helped us to decide what variables to include in the logistic regression model for the NYC data.
set.seed(123)
dfaTrain <- sample_frac(df, 0.75)
dfaTest <- dplyr::setdiff(df, dfaTrain)
cols_to_select <- c(
"high_booking_rate",
"\"24-hour check-in\"",
"\"Air conditioning\"",
"\"Bed linens\"",
"\"Cable TV\"",
"\"Cleaning before checkout\"",
"\"Coffee maker\"",
"\"Dishes and silverware\"",
"\"Family/kid friendly\"",
"\"Free street parking\"",
"\"Hair dryer\"",
"\"Host greets you\"",
"\"Hot water\"",
"\"Laptop friendly workspace\"",
"\"Luggage dropoff allowed\"",
"\"No stairs or steps to enter\"",
"\"Paid parking off premises\"",
"\"Pets allowed\"",
"\"Pets live on this property\"",
"\"Safety card\"",
"\"Self check-in\"",
"\"Smoke detector\"",
"Dishwasher",
"Doorman",
"Elevator",
"Gym",
"Heating",
"Internet",
"Iron",
"Kitchen",
"Microwave",
"Oven",
"TV",
"Washer",
"property_type",
"host_is_superhost",
"room_type",
"review_scores_checkin",
"review_scores_location",
"review_scores_value",
"borough",
"price",
"access",
"host_response_time",
"cleaning_fee",
"host_since",
"house_rules",
"minimum_nights",
"instant_bookable",
"stations_in_a_mile"
)
dfaTrain <- dfaTrain %>%
select(cols_to_select)
dfaTrain <- dfaTrain %>%
mutate(high_booking_rate = as.factor(high_booking_rate))
glmModel <- train(high_booking_rate ~ .,
family = 'binomial',
method = "glm",
data = dfaTrain %>% drop_na())
summary(glmModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7868 -0.5913 -0.2892 0.3974 5.4598
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.787e-01 7.491e-01 0.906 0.364923
`\\`"24-hour check-in"\\`1` 5.263e-01 7.928e-02 6.638 3.18e-11 ***
`\\`"Air conditioning"\\`1` 2.781e-01 7.398e-02 3.759 0.000171 ***
`\\`"Bed linens"\\`1` -1.414e-01 5.498e-02 -2.571 0.010135 *
`\\`"Cable TV"\\`1` 3.604e-01 5.568e-02 6.474 9.55e-11 ***
`\\`"Cleaning before checkout"\\`1` -3.096e-01 1.088e-01 -2.846 0.004433 **
`\\`"Coffee maker"\\`1` 1.467e-01 6.149e-02 2.386 0.017030 *
`\\`"Dishes and silverware"\\`1` 3.292e-01 7.449e-02 4.420 9.88e-06 ***
`\\`"Family/kid friendly"\\`1` 8.732e-01 5.344e-02 16.338 < 2e-16 ***
`\\`"Free street parking"\\`1` 2.557e-01 5.535e-02 4.619 3.85e-06 ***
`\\`"Hair dryer"\\`1` 5.231e-01 6.292e-02 8.313 < 2e-16 ***
`\\`"Host greets you"\\`1` 3.344e-01 6.142e-02 5.445 5.19e-08 ***
`\\`"Hot water"\\`1` 6.741e-02 6.397e-02 1.054 0.292011
`\\`"Laptop friendly workspace"\\`1` -2.167e-01 5.284e-02 -4.100 4.13e-05 ***
`\\`"Luggage dropoff allowed"\\`1` 9.369e-02 5.610e-02 1.670 0.094918 .
`\\`"No stairs or steps to enter"\\`1` 2.505e-01 7.070e-02 3.543 0.000395 ***
`\\`"Paid parking off premises"\\`1` 1.680e-01 6.752e-02 2.487 0.012871 *
`\\`"Pets allowed"\\`1` -1.912e-01 7.335e-02 -2.607 0.009144 **
`\\`"Pets live on this property"\\`1` 3.878e-01 8.563e-02 4.529 5.91e-06 ***
`\\`"Safety card"\\`1` 1.996e-01 7.515e-02 2.657 0.007894 **
`\\`"Self check-in"\\`1` 3.866e-01 5.505e-02 7.023 2.17e-12 ***
`\\`"Smoke detector"\\`1` -2.108e-01 7.801e-02 -2.702 0.006899 **
Dishwasher1 -1.187e-01 6.645e-02 -1.786 0.074042 .
Doorman1 -3.500e-01 1.462e-01 -2.394 0.016657 *
Elevator1 -2.014e-01 6.567e-02 -3.067 0.002161 **
Gym1 -8.207e-01 1.098e-01 -7.477 7.61e-14 ***
Heating1 5.322e-01 1.400e-01 3.802 0.000144 ***
Internet1 6.854e-01 6.036e-02 11.355 < 2e-16 ***
Iron1 2.044e-01 5.861e-02 3.487 0.000489 ***
Kitchen1 -3.684e-01 7.943e-02 -4.638 3.51e-06 ***
Microwave1 1.602e-01 6.312e-02 2.538 0.011161 *
Oven1 -1.228e-01 6.893e-02 -1.782 0.074757 .
TV1 -2.291e-01 5.332e-02 -4.297 1.73e-05 ***
Washer1 -3.717e-01 5.271e-02 -7.051 1.77e-12 ***
property_typeApartment 8.946e-02 7.845e-02 1.140 0.254122
property_typeHouse 1.233e-01 1.028e-01 1.199 0.230533
property_typeTownhouse 1.114e-01 1.258e-01 0.885 0.376195
host_is_superhostTRUE 9.422e-01 4.996e-02 18.859 < 2e-16 ***
`room_typeHotel room` -4.631e-01 2.798e-01 -1.655 0.097961 .
`room_typePrivate room` -6.172e-02 5.750e-02 -1.073 0.283112
`room_typeShared room` -7.639e-01 1.583e-01 -4.825 1.40e-06 ***
review_scores_checkin 3.116e-01 5.006e-02 6.225 4.83e-10 ***
review_scores_location -6.700e-02 3.806e-02 -1.761 0.078318 .
review_scores_value -1.389e-01 3.421e-02 -4.060 4.91e-05 ***
boroughBrooklyn 1.371e-01 1.312e-01 1.046 0.295729
boroughManhattan 5.350e-01 1.350e-01 3.963 7.41e-05 ***
boroughQueens 1.399e-01 1.360e-01 1.029 0.303648
`boroughStaten Island` 9.822e-02 2.439e-01 0.403 0.687143
price -1.518e-03 2.603e-04 -5.832 5.49e-09 ***
accessTRUE 5.030e-01 4.965e-02 10.132 < 2e-16 ***
`host_response_timeN/A` -1.185e+00 2.108e-01 -5.622 1.89e-08 ***
`host_response_timewithin a day` 2.952e-01 2.117e-01 1.395 0.163104
`host_response_timewithin a few hours` 5.102e-01 2.075e-01 2.459 0.013924 *
`host_response_timewithin an hour` 7.707e-01 2.044e-01 3.770 0.000163 ***
cleaning_fee -4.171e-03 6.048e-04 -6.896 5.35e-12 ***
host_since -3.171e-04 2.827e-05 -11.219 < 2e-16 ***
house_rulesTRUE 4.599e-01 5.185e-02 8.870 < 2e-16 ***
minimum_nights -6.714e-02 3.779e-03 -17.768 < 2e-16 ***
instant_bookableTRUE 2.602e-01 4.858e-02 5.355 8.54e-08 ***
stations_in_a_mile1 3.815e-01 8.517e-02 4.479 7.50e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20355 on 17693 degrees of freedom
Residual deviance: 13520 on 17634 degrees of freedom
AIC: 13640
Number of Fisher Scoring iterations: 6
exp(coef(glm(high_booking_rate ~ ., family = "binomial", data = dfaTrain %>% drop_na())))
glm.fit: fitted probabilities numerically 0 or 1 occurred
(Intercept) `"24-hour check-in"`1 `"Air conditioning"`1
1.9713410 1.6925894 1.3205960
`"Bed linens"`1 `"Cable TV"`1 `"Cleaning before checkout"`1
0.8681817 1.4339735 0.7337471
`"Coffee maker"`1 `"Dishes and silverware"`1 `"Family/kid friendly"`1
1.1580249 1.3899111 2.3944593
`"Free street parking"`1 `"Hair dryer"`1 `"Host greets you"`1
1.2913341 1.6872255 1.3970831
`"Hot water"`1 `"Laptop friendly workspace"`1 `"Luggage dropoff allowed"`1
1.0697340 0.8052088 1.0982148
`"No stairs or steps to enter"`1 `"Paid parking off premises"`1 `"Pets allowed"`1
1.2846848 1.1828803 0.8259628
`"Pets live on this property"`1 `"Safety card"`1 `"Self check-in"`1
1.4737835 1.2209506 1.4719674
`"Smoke detector"`1 Dishwasher1 Doorman1
0.8099667 0.8880696 0.7046735
Elevator1 Gym1 Heating1
0.8175679 0.4401253 1.7026123
Internet1 Iron1 Kitchen1
1.9845323 1.2267301 0.6918304
Microwave1 Oven1 TV1
1.1737045 0.8844091 0.7952471
Washer1 property_typeApartment property_typeHouse
0.6895681 1.0935872 1.1312039
property_typeTownhouse host_is_superhostTRUE room_typeHotel room
1.1177972 2.5656131 0.6293512
room_typePrivate room room_typeShared room review_scores_checkin
0.9401467 0.4658532 1.3656097
review_scores_location review_scores_value boroughBrooklyn
0.9351941 0.8703398 1.1469955
boroughManhattan boroughQueens boroughStaten Island
1.7074202 1.1502013 1.1032031
price accessTRUE host_response_timeN/A
0.9984831 1.6536926 0.3056733
host_response_timewithin a day host_response_timewithin a few hours host_response_timewithin an hour
1.3434498 1.6655940 2.1612810
cleaning_fee host_since house_rulesTRUE
0.9958379 0.9996829 1.5839345
minimum_nights instant_bookableTRUE stations_in_a_mile1
0.9350610 1.2971510 1.4644168
skim(dfaTrain)
── Data Summary ────────────────────────
Values
Name dfaTrain
Number of rows 22941
Number of columns 50
_______________________
Column type frequency:
character 1
Date 1
factor 42
numeric 6
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
[38;5;250m1[39m borough 0 1 5 13 0 5 0
── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
[38;5;250m1[39m host_since 272 0.988 2008-09-07 2019-12-03 2015-06-15 [4m3[24m519
── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
[38;5;250m 1[39m [38;5;246m"[39mhigh_booking_rate[38;5;246m"[39m 0 1 FALSE 2 0: 18251, 1: 4690
[38;5;250m 2[39m [38;5;246m"[39m\"24-hour check-in\"[38;5;246m"[39m 0 1 FALSE 2 0: 20878, 1: 2063
[38;5;250m 3[39m [38;5;246m"[39m\"Air conditioning\"[38;5;246m"[39m 0 1 FALSE 2 1: 19530, 0: 3411
[38;5;250m 4[39m [38;5;246m"[39m\"Bed linens\"[38;5;246m"[39m 0 1 FALSE 2 0: 15814, 1: 7127
[38;5;250m 5[39m [38;5;246m"[39m\"Cable TV\"[38;5;246m"[39m 0 1 FALSE 2 0: 17652, 1: 5289
[38;5;250m 6[39m [38;5;246m"[39m\"Cleaning before checkout\"[38;5;246m"[39m 0 1 FALSE 2 0: 22122, 1: 819
[38;5;250m 7[39m [38;5;246m"[39m\"Coffee maker\"[38;5;246m"[39m 0 1 FALSE 2 0: 16058, 1: 6883
[38;5;250m 8[39m [38;5;246m"[39m\"Dishes and silverware\"[38;5;246m"[39m 0 1 FALSE 2 0: 14365, 1: 8576
[38;5;250m 9[39m [38;5;246m"[39m\"Family/kid friendly\"[38;5;246m"[39m 0 1 FALSE 2 0: 17306, 1: 5635
[38;5;250m10[39m [38;5;246m"[39m\"Free street parking\"[38;5;246m"[39m 0 1 FALSE 2 0: 15292, 1: 7649
[38;5;250m11[39m [38;5;246m"[39m\"Hair dryer\"[38;5;246m"[39m 0 1 FALSE 2 1: 15193, 0: 7748
[38;5;250m12[39m [38;5;246m"[39m\"Host greets you\"[38;5;246m"[39m 0 1 FALSE 2 0: 19506, 1: 3435
[38;5;250m13[39m [38;5;246m"[39m\"Hot water\"[38;5;246m"[39m 0 1 FALSE 2 1: 12305, 0: 10636
[38;5;250m14[39m [38;5;246m"[39m\"Laptop friendly workspace\"[38;5;246m"[39m 0 1 FALSE 2 1: 14619, 0: 8322
[38;5;250m15[39m [38;5;246m"[39m\"Luggage dropoff allowed\"[38;5;246m"[39m 0 1 FALSE 2 0: 19088, 1: 3853
[38;5;250m16[39m [38;5;246m"[39m\"No stairs or steps to enter\"[38;5;246m"[39m 0 1 FALSE 2 0: 21042, 1: 1899
[38;5;250m17[39m [38;5;246m"[39m\"Paid parking off premises\"[38;5;246m"[39m 0 1 FALSE 2 0: 20751, 1: 2190
[38;5;250m18[39m [38;5;246m"[39m\"Pets allowed\"[38;5;246m"[39m 0 1 FALSE 2 0: 20102, 1: 2839
[38;5;250m19[39m [38;5;246m"[39m\"Pets live on this property\"[38;5;246m"[39m 0 1 FALSE 2 0: 21609, 1: 1332
[38;5;250m20[39m [38;5;246m"[39m\"Safety card\"[38;5;246m"[39m 0 1 FALSE 2 0: 21135, 1: 1806
[38;5;250m21[39m [38;5;246m"[39m\"Self check-in\"[38;5;246m"[39m 0 1 FALSE 2 0: 17856, 1: 5085
[38;5;250m22[39m [38;5;246m"[39m\"Smoke detector\"[38;5;246m"[39m 0 1 FALSE 2 1: 19979, 0: 2962
[38;5;250m23[39m [38;5;246m"[39mDishwasher[38;5;246m"[39m 0 1 FALSE 2 0: 19609, 1: 3332
[38;5;250m24[39m [38;5;246m"[39mDoorman[38;5;246m"[39m 0 1 FALSE 2 0: 21970, 1: 971
[38;5;250m25[39m [38;5;246m"[39mElevator[38;5;246m"[39m 0 1 FALSE 2 0: 16725, 1: 6216
[38;5;250m26[39m [38;5;246m"[39mGym[38;5;246m"[39m 0 1 FALSE 2 0: 20751, 1: 2190
[38;5;250m27[39m [38;5;246m"[39mHeating[38;5;246m"[39m 0 1 FALSE 2 1: 21485, 0: 1456
[38;5;250m28[39m [38;5;246m"[39mInternet[38;5;246m"[39m 0 1 FALSE 2 0: 16462, 1: 6479
[38;5;250m29[39m [38;5;246m"[39mIron[38;5;246m"[39m 0 1 FALSE 2 1: 14298, 0: 8643
[38;5;250m30[39m [38;5;246m"[39mKitchen[38;5;246m"[39m 0 1 FALSE 2 1: 20814, 0: 2127
[38;5;250m31[39m [38;5;246m"[39mMicrowave[38;5;246m"[39m 0 1 FALSE 2 0: 15247, 1: 7694
[38;5;250m32[39m [38;5;246m"[39mOven[38;5;246m"[39m 0 1 FALSE 2 0: 15490, 1: 7451
[38;5;250m33[39m [38;5;246m"[39mTV[38;5;246m"[39m 0 1 FALSE 2 1: 15623, 0: 7318
[38;5;250m34[39m [38;5;246m"[39mWasher[38;5;246m"[39m 0 1 FALSE 2 0: 13777, 1: 9164
[38;5;250m35[39m [38;5;246m"[39mproperty_type[38;5;246m"[39m 0 1 FALSE 4 Apa: 18023, Oth: 2264, Hou: 1881, Tow: 773
[38;5;250m36[39m [38;5;246m"[39mhost_is_superhost[38;5;246m"[39m 0 1 FALSE 2 FAL: 18497, TRU: 4444
[38;5;250m37[39m [38;5;246m"[39mroom_type[38;5;246m"[39m 0 1 FALSE 4 Ent: 11732, Pri: 10455, Sha: 570, Hot: 184
[38;5;250m38[39m [38;5;246m"[39maccess[38;5;246m"[39m 0 1 FALSE 2 TRU: 12301, FAL: 10640
[38;5;250m39[39m [38;5;246m"[39mhost_response_time[38;5;246m"[39m 0 1 FALSE 5 wit: 9119, N/A: 8059, wit: 3263, wit: 2066
[38;5;250m40[39m [38;5;246m"[39mhouse_rules[38;5;246m"[39m 0 1 FALSE 2 TRU: 14069, FAL: 8872
[38;5;250m41[39m [38;5;246m"[39minstant_bookable[38;5;246m"[39m 0 1 FALSE 2 FAL: 14290, TRU: 8651
[38;5;250m42[39m [38;5;246m"[39mstations_in_a_mile[38;5;246m"[39m 0 1 FALSE 2 0: 21224, 1: 1717
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
[38;5;250m1[39m review_scores_checkin [4m5[24m013 0.781 9.73 0.743 2 10 10 10 10 ▁▁▁▁▇
[38;5;250m2[39m review_scores_location [4m5[24m014 0.781 9.58 0.759 2 9 10 10 10 ▁▁▁▁▇
[38;5;250m3[39m review_scores_value [4m5[24m016 0.781 9.39 0.926 2 9 10 10 10 ▁▁▁▁▇
[38;5;250m4[39m price 0 1 158. 349. 0 69 105 175 [4m1[24m[4m0[24m000 ▇▁▁▁▁
[38;5;250m5[39m cleaning_fee 0 1 66.8 50.6 0 30 67.5 80 600 ▇▁▁▁▁
[38;5;250m6[39m minimum_nights 0 1 7.27 21.0 1 1 2 5 [4m1[24m000 ▇▁▁▁▁
resultsGLM <-
glmModel %>%
predict(dfaTest %>% drop_na() , type = 'prob') %>%
bind_cols(dfaTest %>% drop_na() , predictedClass = .) %>%
mutate(predictedClass = as.factor(ifelse(`1` > 0.6 , 1, 0)))
resultsGLM %>%
xtabs(~predictedClass+high_booking_rate, .) %>%
confusionMatrix(positive = '1')
Confusion Matrix and Statistics
high_booking_rate
predictedClass 0 1
0 5 4
1 1 2
Accuracy : 0.5833
95% CI : (0.2767, 0.8483)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.3872
Kappa : 0.1667
Mcnemar's Test P-Value : 0.3711
Sensitivity : 0.3333
Specificity : 0.8333
Pos Pred Value : 0.6667
Neg Pred Value : 0.5556
Prevalence : 0.5000
Detection Rate : 0.1667
Detection Prevalence : 0.2500
Balanced Accuracy : 0.5833
'Positive' Class : 1
We developed our business case focusing on avoiding the negative consequences of investing in properties that would yield low booking rates. Therefore, we wanted to minimize the False Positive Rate (FPR) . We increased the specificity. The cutoff of 0.6 reduced the number of false positives. We had to make a tradeoff since decreasing false positives increased the false negatives as well. However, for a property based in NY which can be expensive, false positives or incorrectly classifying a property with a low booking rate as one with a high booking rate would have a greater risk associated with it to the investor.
We have 2 part recommendations to achieve high booking rates. The first is about products to be purchased or factors to be considered before buying the property. The second set focuses on maintenance and marketing of the property along with services to be offered to the guests during their stay. While the first part is important, we observe that more emphasis should be laid on services offered after purchasing the property.
This kind of customized experience in turn will translate into high booking rates and increased profits.
[1] https://www.citylab.com/equity/2018/03/what-airbnb-did-to-new-york-city/552749/45ad8a941a5a
[2] https://smartasset.com/mortgage/where-do-airbnb-hosts-make-the-most-money
[3] https://ny.curbed.com/2019/12/13/21009872/nyc-home-value-2010s-manhattan-apartments